In 2020, the entire world was shaken by the global SARS-CoV-2 pandemic. The first case was reported in China in mid-December 2019. The animal and seafood market in Wuhan, China, is indicated as source of the disease. A lot of countries introduced a state of emergency to limit the spread of the virus. Unfortunately, due to many interstate connections, the connections, the virus spread very quickly to many continents.
The SARS-CoV-2 virus causes the COVID-19 disease, which the symptomps are very similar to those of the seasonal flu. The virus affects the respiratory organs, mainly the lungs.The disease is most dangerous for the elderly and people with so-called concomitant diseases (e.g. diabetes, lung diseases, cardiovascular diseases). Common symptoms of coronavirus infection are:The disease may lead to complications, e.g. pneumonia, acute respiratory distress syndrome.
To date (November 2020) there is no vaccine or effective antiviral drugs. Treatment is based on symptomatic treatment and supportive therapy (in the case of respiratory disorder). In order to counteract the spread of disease, frequent hand washing and surface disinfection are recommended.
The economy was significantly affected by the pandemic. Many countries have decided on implementing the so-called Lockdown - a temporary shutdown of the economy in order to avoid rallying people and infecting others. The education system also suffered - online learning was introduced in many schools and universities.
The following document is a coronavirus cases study based on this article published on May 14, 2020. The data concerns 375 patients in China (Tongji Hospital). In the conducted analysis, the most discriminating biomarkers of patient mortality were identified using machine learning tools. The problem was defined as a classification task, where the input data included blood samples and laboratory test results.
The data set is import from flat file.
cov_cs_df <- read_excel("data\\wuhan_blood_sample_data_Jan_Feb_2020.xlsx")
Dataset has 6120 rows and 81 columns. It is too much to show them all. Below there is only few of columns:
kable(head(cov_cs_df[1:10], 5))
| PATIENT_ID | RE_DATE | age | gender | Admission time | Discharge time | outcome | Hypersensitive cardiac troponinI | hemoglobin | Serum chloride |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2020-01-31 01:09:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | NA | NA | NA |
| NA | 2020-01-31 01:25:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | NA | 136 | NA |
| NA | 2020-01-31 01:44:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | NA | NA | 103.1 |
| NA | 2020-01-31 01:45:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | NA | NA | NA |
| NA | 2020-01-31 01:56:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | 19.9 | NA | NA |
That shows us, the dataset is illegible. Fisrt, we need to clean data to enhance readability of dataset.
data_df <- cov_cs_df %>%
mutate(gender = as.factor(ifelse(gender==1, "male", "female"))) %>%
mutate(outcome = as.factor(ifelse(outcome == 0, "survived", "died"))) %>%
rename(admission_time = 'Admission time',
discharge_time = 'Discharge time')
colnames(data_df)[34] <- "Tumor_necrosis_factor_alfa"
colnames(data_df)[37] <- "Interleukin_1_Beta"
colnames(data_df)[68] <- "gamma_glutamyl_transpeptidase"
patients_df <- data_df %>%
select(PATIENT_ID, age, gender, admission_time, discharge_time, outcome) %>%
drop_na(PATIENT_ID) %>%
mutate("hospitalization_length" = seconds_to_period(difftime(discharge_time,
admission_time,
units = "days" ))) %>%
relocate(hospitalization_length, .after = discharge_time)
data_df <- data_df %>%
fill(PATIENT_ID)
After replace NA values in PATIENT_ID column, the dataset looks like below.
kable(head(data_df[1:8], 10))
| PATIENT_ID | RE_DATE | age | gender | admission_time | discharge_time | outcome | Hypersensitive cardiac troponinI |
|---|---|---|---|---|---|---|---|
| 1 | 2020-01-31 01:09:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 01:25:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 01:44:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 01:45:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 01:56:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 19.9 |
| 1 | 2020-01-31 01:59:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 02:09:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 06:44:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-02-04 19:42:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-02-06 09:14:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
We know, that most of the patients had multiple blood samples taken throughout their stay in a hospital. To prepare the model of prediction the mortality of patients, the next step of a cleaning process is to extract only one observation to patient. Only the final sample data will be used for the model training and the testing.
data_df <- data_df %>%
group_by(PATIENT_ID) %>%
fill(colnames(data_df)) %>%
summarise(across(everything(), last), .groups="drop")
Again, you can see, there is a NA values in some column. That requires replacing those values. It is possible to do it in many ways, but here, the NA values in each column will be replace by median of the values in each column.
for(i in 8:ncol(data_df))
{
val <- data_df[i]
val[is.na(val)] <- median(val[!is.na(val)])
data_df[i] <- val
}
After that, the dataset is clean and looks like below.
kable(head(data_df[1:8], 10))
| PATIENT_ID | RE_DATE | age | gender | admission_time | discharge_time | outcome | Hypersensitive cardiac troponinI |
|---|---|---|---|---|---|---|---|
| 1 | 2020-02-17 08:31:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 19.9 |
| 2 | 2020-02-17 15:34:00 | 61 | male | 2020-02-04 21:39:03 | 2020-02-19 12:59:01 | survived | 1.9 |
| 3 | 2020-02-06 23:15:00 | 70 | female | 2020-01-23 10:59:36 | 2020-02-08 17:52:31 | survived | 12.3 |
| 4 | 2020-02-17 08:31:00 | 74 | male | 2020-01-31 23:03:59 | 2020-02-18 12:59:12 | survived | 4.8 |
| 5 | 2020-02-18 09:35:00 | 29 | female | 2020-02-01 20:59:54 | 2020-02-18 10:33:06 | survived | 5.6 |
| 6 | 2020-02-04 13:59:00 | 81 | female | 2020-01-24 10:47:10 | 2020-02-07 09:06:58 | survived | 19.7 |
| 7 | 2020-02-06 23:14:00 | 40 | male | 2020-01-20 17:54:43 | 2020-02-08 16:21:46 | survived | 2.4 |
| 8 | 2020-02-17 13:21:00 | 47 | male | 2020-01-30 13:42:08 | 2020-02-18 11:45:27 | survived | 3.6 |
| 9 | 2020-02-17 21:59:00 | 38 | female | 2020-01-28 23:06:46 | 2020-02-18 16:46:01 | survived | 12.3 |
| 10 | 2020-02-17 08:53:00 | 30 | male | 2020-02-01 19:50:17 | 2020-02-19 12:12:57 | survived | 3.3 |
Below short summary of clean dataset.
tbl_summary(
data_df,
by = outcome,
label = gender ~ "Gender") %>%
add_n() %>%
modify_header(label = "") %>%
add_overall() %>%
bold_labels()
| Overall, N = 3751 | N | died, N = 1741 | survived, N = 2011 | |
|---|---|---|---|---|
| PATIENT_ID | 188 (94, 282) | 375 | 288 (245, 332) | 101 (51, 151) |
| age | 62 (46, 70) | 375 | 69 (62, 77) | 51 (37, 62) |
| Gender | 375 | |||
| female | 151 (40%) | 48 (28%) | 103 (51%) | |
| male | 224 (60%) | 126 (72%) | 98 (49%) | |
| Hypersensitive cardiac troponinI | 12 (4, 32) | 375 | 34 (12, 254) | 5 (2, 12) |
| hemoglobin | 125 (113, 137) | 375 | 122 (109, 137) | 126 (117, 137) |
| Serum chloride | 102 (100, 105) | 375 | 103 (100, 111) | 102 (100, 103) |
| Prothrombin time | 14.3 (13.4, 16.2) | 375 | 16.3 (14.9, 18.8) | 13.6 (13.1, 14.1) |
| procalcitonin | 0.10 (0.04, 0.31) | 375 | 0.32 (0.11, 0.93) | 0.04 (0.02, 0.10) |
| eosinophils(%) | 0.25 (0.00, 1.40) | 375 | 0.00 (0.00, 0.10) | 1.30 (0.50, 2.10) |
| Interleukin 2 receptor | 664 (594, 752) | 375 | 664 (664, 1,182) | 664 (460, 664) |
| Alkaline phosphatase | 71 (54, 96) | 375 | 91 (68, 129) | 61 (50, 75) |
| albumin | 33.2 (28.6, 37.5) | 375 | 28.1 (24.6, 31.7) | 36.9 (33.4, 39.3) |
| basophil(%) | 0.20 (0.10, 0.40) | 375 | 0.10 (0.10, 0.20) | 0.30 (0.20, 0.50) |
| Interleukin 10 | 5.2 (5.0, 6.6) | 375 | 5.2 (5.2, 11.8) | 5.2 (5.0, 5.2) |
| Total bilirubin | 11 (7, 16) | 375 | 13 (10, 22) | 8 (6, 12) |
| Platelet count | 192 (118, 256) | 375 | 114 (59, 184) | 238 (192, 297) |
| monocytes(%) | 6.2 (3.2, 8.7) | 375 | 3.0 (2.0, 5.4) | 8.3 (6.9, 10.1) |
| antithrombin | 87 (86, 88) | 375 | 87 (76, 87) | 87 (87, 93) |
| Interleukin 8 | 15 (13, 18) | 375 | 15 (15, 33) | 15 (8, 15) |
| indirect bilirubin | 5.3 (3.9, 7.8) | 375 | 5.5 (4.3, 8.5) | 5.1 (3.7, 7.2) |
| Red blood cell distribution width | 12.75 (12.10, 13.70) | 375 | 13.20 (12.62, 14.60) | 12.40 (11.90, 13.00) |
| neutrophils(%) | 78 (63, 92) | 375 | 92 (87, 95) | 64 (55, 72) |
| total protein | 66 (62, 70) | 375 | 63 (58, 68) | 68 (65, 71) |
| Quantification of Treponema pallidum antibodies | 0.050 (0.040, 0.060) | 375 | 0.050 (0.050, 0.060) | 0.050 (0.040, 0.060) |
| Prothrombin activity | 86 (68, 97) | 375 | 67 (53, 80) | 94 (87, 103) |
| HBsAg | 0.010 (0.000, 0.010) | 375 | 0.010 (0.010, 0.010) | 0.010 (0.000, 0.010) |
| mean corpuscular volume | 90.4 (87.2, 93.8) | 375 | 90.4 (87.6, 96.0) | 90.3 (87.0, 92.5) |
| hematocrit | 36.3 (33.4, 39.9) | 375 | 35.9 (32.4, 39.9) | 36.9 (34.1, 39.9) |
| White blood cell count | 8 (5, 13) | 375 | 12 (8, 17) | 6 (5, 8) |
| Tumor_necrosis_factor_alfa | 8.3 (7.7, 8.8) | 375 | 8.3 (8.3, 11.4) | 8.3 (6.8, 8.3) |
| mean corpuscular hemoglobin concentration | 342 (332, 349) | 375 | 342 (329, 349) | 342 (334, 349) |
| fibrinogen | 4.22 (3.55, 5.10) | 375 | 4.22 (3.01, 5.28) | 4.22 (3.64, 5.02) |
| Interleukin_1_Beta | 5.00 (5.00, 5.00) | 375 | 5.00 (5.00, 5.00) | 5.00 (5.00, 5.00) |
| Urea | 5 (4, 11) | 375 | 11 (6, 18) | 4 (3, 5) |
| lymphocyte count | 0.99 (0.54, 1.52) | 375 | 0.52 (0.32, 0.80) | 1.46 (1.08, 1.80) |
| PH value | 6.00 (6.00, 6.50) | 375 | 6.00 (6.00, 6.50) | 6.00 (6.00, 6.50) |
| Red blood cell count | 4.10 (3.58, 4.62) | 375 | 4.10 (3.54, 4.60) | 4.11 (3.61, 4.63) |
| Eosinophil count | 0.02 (0.00, 0.08) | 375 | 0.00 (0.00, 0.01) | 0.07 (0.03, 0.11) |
| Corrected calcium | 2.37 (2.26, 2.44) | 375 | 2.35 (2.25, 2.42) | 2.38 (2.28, 2.45) |
| Serum potassium | 4.43 (4.06, 4.79) | 375 | 4.43 (3.93, 5.04) | 4.43 (4.17, 4.68) |
| glucose | 6.5 (5.2, 9.4) | 375 | 8.7 (6.7, 12.4) | 5.3 (4.8, 6.5) |
| neutrophils count | 5 (3, 11) | 375 | 11 (7, 15) | 4 (3, 5) |
| Direct bilirubin | 5 (3, 7) | 375 | 7 (5, 12) | 3 (3, 5) |
| Mean platelet volume | 10.80 (10.20, 11.60) | 375 | 11.30 (10.80, 12.30) | 10.40 (9.90, 10.90) |
| ferritin | 760 (621, 856) | 375 | 760 (760, 1,425) | 760 (417, 760) |
| RBC distribution width SD | 41.2 (38.9, 44.6) | 375 | 43.1 (40.4, 48.6) | 40.3 (38.0, 42.1) |
| Thrombin time | 16.55 (15.80, 17.40) | 375 | 16.55 (16.12, 18.58) | 16.55 (15.70, 16.90) |
| (%)lymphocyte | 14 (5, 27) | 375 | 4 (2, 9) | 25 (18, 33) |
| HCV antibody quantification | 0.06 (0.05, 0.08) | 375 | 0.06 (0.05, 0.09) | 0.06 (0.05, 0.07) |
| D-D dimer | 1 (1, 9) | 375 | 11 (2, 21) | 1 (0, 1) |
| Total cholesterol | 3.72 (2.99, 4.36) | 375 | 3.20 (2.61, 3.72) | 4.24 (3.69, 4.69) |
| aspartate aminotransferase | 25 (19, 40) | 375 | 40 (25, 64) | 20 (16, 25) |
| Uric acid | 260 (203, 342) | 375 | 260 (190, 381) | 260 (206, 325) |
| HCO3- | 23.9 (21.0, 26.2) | 375 | 21.2 (17.6, 23.9) | 25.5 (23.8, 27.4) |
| calcium | 2.11 (2.00, 2.22) | 375 | 1.99 (1.89, 2.10) | 2.20 (2.11, 2.29) |
| Amino-terminal brain natriuretic peptide precursor(NT-proBNP) | 304 (123, 866) | 375 | 921 (304, 3,666) | 163 (29, 304) |
| Lactate dehydrogenase | 274 (202, 598) | 375 | 618 (437, 880) | 204 (177, 245) |
| platelet large cell ratio | 31 (26, 37) | 375 | 35 (31, 42) | 28 (24, 32) |
| Interleukin 6 | 18 (13, 24) | 375 | 18 (18, 70) | 18 (3, 18) |
| Fibrin degradation products | 6 (5, 7) | 375 | 7 (6, 150) | 6 (4, 6) |
| monocytes count | 0.43 (0.32, 0.60) | 375 | 0.40 (0.21, 0.56) | 0.47 (0.37, 0.61) |
| PLT distribution width | 12.50 (11.10, 14.30) | 375 | 13.60 (12.50, 15.95) | 11.70 (10.70, 12.70) |
| globulin | 32.4 (29.1, 35.7) | 375 | 33.7 (31.2, 37.5) | 31.2 (28.2, 33.4) |
| gamma_glutamyl_transpeptidase | 33 (22, 52) | 375 | 38 (27, 72) | 29 (18, 44) |
| International standard ratio | 1.10 (1.02, 1.29) | 375 | 1.30 (1.15, 1.54) | 1.04 (0.98, 1.09) |
| basophil count(#) | 0.020 (0.010, 0.030) | 375 | 0.020 (0.010, 0.030) | 0.020 (0.010, 0.030) |
| 2019-nCoV nucleic acid detection | 375 | |||
| -1 | 375 (100%) | 174 (100%) | 201 (100%) | |
| mean corpuscular hemoglobin | 30.90 (29.80, 32.20) | 375 | 30.90 (30.00, 32.27) | 30.90 (29.60, 32.10) |
| Activation of partial thromboplastin time | 39 (37, 43) | 375 | 39 (37, 44) | 39 (36, 41) |
| High sensitivity C-reactive protein | 26 (2, 96) | 375 | 97 (57, 177) | 2 (1, 13) |
| HIV antibody quantification | 0.09 (0.08, 0.10) | 375 | 0.09 (0.07, 0.10) | 0.09 (0.08, 0.10) |
| serum sodium | 141 (138, 143) | 375 | 141 (138, 147) | 141 (139, 142) |
| thrombocytocrit | 0.21 (0.15, 0.27) | 375 | 0.16 (0.10, 0.21) | 0.25 (0.21, 0.31) |
| ESR | 28 (16, 40) | 375 | 28 (20, 46) | 28 (14, 36) |
| glutamic-pyruvic transaminase | 26 (17, 41) | 375 | 27 (19, 46) | 25 (16, 37) |
| eGFR | 89 (68, 104) | 375 | 70 (41, 91) | 98 (86, 112) |
| creatinine | 74 (58, 97) | 375 | 88 (66, 148) | 66 (55, 83) |
|
1
Statistics presented: Median (IQR); n (%)
|
||||
With clean dataset, it is possible to start analyse dataset.
First, we can check number of patients divided into genders.
gender_bar_plot <- ggplot(patients_df, aes(x = gender, fill = gender)) +
geom_bar() +
theme_bw() +
labs(title = "Number of patients divided into gender",
x = "Gender",
y = "Number of patients",
fill = "Gender")
ggsave(filename = "gender_bar_plot.svg",
plot = gender_bar_plot,
device = "svg",
path = "plots")
ggplotly(gender_bar_plot)
Below the chart showing ages of patients.
gender_age_hist <- ggplot(patients_df, aes(x = age, fill=gender)) +
geom_histogram(binwidth = 5.0) +
theme_bw() +
labs(title = "Histogram of patients age and genders.",
x = "Age",
y = "Number of patients",
fill = "Gender") +
scale_x_continuous(breaks = seq(0, max(patients_df$age), 5))
ggsave(filename = "gender_age_plot.svg",
plot = gender_age_hist,
device = "svg",
path = "plots")
ggplotly(gender_age_hist)
It is possible to create two histograms for each gender.
gender_age_histograms <- ggplot(patients_df, aes(x = age, fill=gender)) +
geom_histogram(binwidth = 5.0) +
theme_bw() +
facet_wrap(~gender) +
labs(title = "Histogram of patients age and genders.",
x = "Age",
y = "Number of patients",
fill = "Gender") +
scale_x_continuous(breaks = seq(0, max(patients_df$age), 5))
ggsave(filename = "gender_age_histograms.svg",
plot = gender_age_histograms,
device = "svg",
path = "plots")
ggplotly(gender_age_histograms)
Let show the histograms about outcome due to hospitalization length. First, histogram including each genders.
hosp_length_hist <- ggplot(patients_df, aes(x = hospitalization_length, fill = gender)) +
geom_histogram(binwidth = 1.0) +
theme_bw() +
scale_x_continuous(breaks=seq(0, max(patients_df$hospitalization_length), 1)) +
labs(title = "Number of patients and their hospitalization length",
x = "Hospitalization length (in days)",
y = "Number of patients",
fill= "Gender")
ggsave(filename = "hospitalization_length_histograms.svg",
plot = hosp_length_hist,
device = "svg",
path = "plots")
ggplotly(hosp_length_hist)
Next chart will be the histogram including information about did the patient survived or died.
hosp_length_outcome_hist <- ggplot(patients_df, aes(x = hospitalization_length,
fill = outcome)) +
geom_histogram(binwidth = 1.0) +
theme_bw() +
scale_x_continuous(breaks = seq(0, max(patients_df$hospitalization_length), 1)) +
labs(title = "Number of patients, their hospitalization length and outcome (survived or died)",
x = "Hospitalization length (in days)",
y = "Number of patients",
fill = "Outcome")
ggsave(filename = "hospitalization_length_outcome_histogram.svg",
plot = hosp_length_outcome_hist,
device = "svg",
path = "plots")
ggplotly(hosp_length_outcome_hist)
Next, the histograms including every information above, but divided by genders and outcomes.
hosp_length_outcomes_hists <- ggplot(patients_df, aes(x = hospitalization_length,
fill = outcome)) +
geom_histogram(binwidth = 1.0) +
theme_bw() +
facet_grid(gender~outcome) +
scale_x_continuous(breaks = seq(0, max(patients_df$hospitalization_length), 5)) +
scale_y_continuous(breaks = seq(0, 20, 4)) +
labs(title = "Number of patients, their hospitalization length and outcome (survived or died)",
x = "Hospitalization length (in days)",
y = "Number of patients",
fill = "Outcome")
ggsave(filename = "hospitalization_length_outcomes_histograms.svg",
plot = hosp_length_outcomes_hists,
device = "svg",
path = "plots")
## Saving 10 x 5 in image
ggplotly(hosp_length_outcomes_hists)
Following charts will be animated. Next will show the number of patients death in next days
patients_death <- patients_df %>%
select(discharge_time, outcome) %>%
filter(outcome == "died") %>%
mutate(discharge_time = as.integer(difftime(discharge_time,
min(patients_df$discharge_time),
units="days"))) %>%
group_by(discharge_time) %>%
summarise(num_of_deaths = n(), .groups="drop")
animated_deaths_plot <- ggplot(patients_death, aes(x = discharge_time,
y = num_of_deaths)) +
geom_line(color = "blue",
size=1.5) +
theme_bw() +
labs(title = "Number of patients death in next days (from addition day)",
x = "Days after addition day",
y = "Number of deaths") +
scale_x_continuous(breaks = seq(0, max(patients_death$discharge_time), 2)) +
scale_y_continuous(breaks = seq(0, max(patients_death$num_of_deaths), 1)) +
transition_reveal(discharge_time)
anim_save("plots/deaths.gif", animated_deaths_plot)
The chart below will show the aggregate number of deaths in next days.
patients_death <- patients_death %>%
arrange(discharge_time) %>%
mutate(num_of_deaths_agg = cumsum(num_of_deaths))
animated_deaths_agg_plot <- ggplot(patients_death,
aes(x = discharge_time,
y = num_of_deaths_agg)) +
geom_line(color = "blue",
size = 1.5) +
theme_bw() +
labs(title = "Number of aggregate patients death in next days (from addition day)",
x = "Days after addition day",
y = "Number of deaths (aggregate)") +
scale_x_continuous(breaks = seq(0, max(patients_death$discharge_time), 2)) +
scale_y_continuous(breaks = seq(0, max(patients_death$num_of_deaths_agg), 10)) +
transition_reveal(discharge_time)
anim_save("plots/deaths_agg.gif", animated_deaths_agg_plot)
In the following chapter, an attempt was made to create a model that will determine the probability of death from COVID-19 based on the observations and the results of the blood tests.
As assumption was made that each row will be separate observation of unique patient. First you should extract right columns from dataset. The column containing id of the patients is not needed - just like the columns with dates of research and admission/discharge time (we do not know how long the patient will be hospitalized). Also, we need to shuffle rows in dataframe. It is necessary, because first patient who died, is on 202 position in dataframe and we cannot start teaching models before both classes will occur in dataset. To ensure repeatability of the shuffling, the seed is seting.
ml_df <- data_df %>%
select(-c("PATIENT_ID", "RE_DATE", "discharge_time","admission_time"))
set.seed(17)
rows <- sample(nrow(ml_df))
ml_df <- ml_df[rows,]
After preparing the dataset, it is possible to split data into two datasets:
As the learning procces, the repeated cross-validation method was choosen. It means, that the learning procces will be done several times (argument repeats in the trainControl function). Each time, the training set will be splitting into two sets, training set and validation set. The training set will be used to learn model and after that, will be validate with validation set.
inTraining <- createDataPartition(y = ml_df$outcome, p=.70, list=FALSE)
training <- ml_df[inTraining,]
testing <- ml_df[-inTraining,]
ctrl <- trainControl(method="repeatedcv",
number = 2,
repeats = 5,
classProbs = TRUE)
The following subchapters will checking some machine learning methods and the differences between them in accuracy, performance and errors.
The Nearest Neighbours classifier belongs to a group of classifiers based on instance methods. The classification process is done online when a new case needs to be classified. Such methods are generally called lazy learning methods. The idea behind k-NN algorithm is as follows. When there is a need to classify a new object X, the k_NN classifier searches in the pattern space for k objects of the training set closest to the object X. Object X, using majority voting, is assigned to the class that dominates in the set of its k nearest neighbours.
At the beginning, the seed is seting. It ensure the repeatability of experiments.
set.seed(17)
Then, it is possible to start learn process with k-NN method. The attribute k is seting on 10.
tGrid <- expand.grid(k = 10)
knn_fit <- train(outcome ~ .,
data = training,
method = "knn",
trControl = ctrl,
tuneGrid = tGrid)
After training process, the model needs to be tested. The testing set will be used to do that. As a result of testing the model, two types of results can be obtained:
In the following lines, both type of results will be obtained.
knnClasses_prob <- predict(knn_fit,
newdata = testing,
type = "prob")
knnClasses <- predict(knn_fit,
newdata = testing)
With the result with classified samples, it is possible to create confusion matrix. The confusion matrix is the m x m matrix, where m is the number of class labels where the rows correspond to the actual class labels of the test records, and the columns to the class labels assigned to the test records by the classifier. The caret library enables the function to create confusion matrix. This function returns confusion matrix and basic performance statistics.
knn_confMatrix <- caret::confusionMatrix(data = knnClasses, testing$outcome)
knn_confMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction died survived
## died 48 7
## survived 4 53
##
## Accuracy : 0.9018
## 95% CI : (0.8311, 0.9499)
## No Information Rate : 0.5357
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8033
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.9231
## Specificity : 0.8833
## Pos Pred Value : 0.8727
## Neg Pred Value : 0.9298
## Prevalence : 0.4643
## Detection Rate : 0.4286
## Detection Prevalence : 0.4911
## Balanced Accuracy : 0.9032
##
## 'Positive' Class : died
##
You can see the basic performance statistics. The model is not perfect, but its prediction is quite good. Confusion matrix analysis is one of the many ways to assess the accuracy of the model. Another one is ROC curve and area under ROC curve (AUC). ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how capable of distinguishing between the classes the model is. The higher the AUC, the better the model is at predicting survived as survived and died as died. By analogy, the higher the AUC, the better the model is at distinguishing between patients with and without disease.
knn_ROC <- roc(response = testing$outcome,
predictor = knnClasses_prob[, "died"],
levels = rev(levels(testing$outcome)))
plot(knn_ROC,
print.auc = TRUE,
col="black",
lwd = 5)
legend("bottomright",
col=c(par("fg"), "blue"),
legend = c("Empirical"),
lwd=5)
The AUC value is quite big. Now, its need to check if an overfitting occured in our model.
Overfitting is over-matching to the specificty of the training data is most often assciotaed with the loss of generalizability. To check if model is or not is overfitted, the special function was created to create learning curves (similar functions was created also to others testing methods).
my_learning_curve_knn <- function(training_data,
test_data,
outcome = NULL,
trControl = NULL)
{
train_len <- nrow(training_data)
if(train_len < 2)
stop("Training data must have at least 2 data points.")
if(nrow(test_data) < 1)
stop("Test data must have at least 1 data points")
if(is.null(outcome))
stop("Please give a character string for the outcome column name.")
if(is.null(trControl))
stop("Please put the train controller.")
learnCurve <- data.frame(m = integer(train_len),
RMSE_type = character(train_len),
RMSE_value = integer(train_len))
j <- 0
tGrid <- expand.grid(k = 10)
for (i in c(seq(5,train_len, 20), train_len))
{
j <- j + 1
tr <- training_data[1:i,]
set.seed(17)
trainModel <- train(outcome ~ .,
data = tr,
method = "knn",
trControl = trControl,
tuneGrid = tGrid)
learnCurve$m[j] <- i
learnCurve$RMSE_type[j] <- "Training Error"
learnCurve$RMSE_value[j] <- rmse(tr$outcome,
predict(trainModel,
newdata = tr))
j <- j + 1
learnCurve$m[j] <- i
learnCurve$RMSE_type[j] <- "Testing Error"
learnCurve$RMSE_value[j] <- rmse(test_data$outcome,
predict(trainModel,
newdata = test_data))
}
learnCurve[1:j,]
}
This function generates learning curves for testing and training set of model. Below chart shows learning curves for training set and testing set.
knn_lmc <- my_learning_curve_knn(training_data = training,
test_data = testing,
outcome = "outcome",
trControl = ctrl)
knn_learning_curves <- ggplot(knn_lmc, aes(x = m)) +
geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) +
theme_bw() +
labs(title = "Learning Curves",
x = "Examples",
y = "RMSE value",
color = "Learning Curve")
ggsave(filename = "k_nn_learning_curves.svg",
plot = knn_learning_curves,
device = "svg",
path = "plots")
ggplotly(knn_learning_curves)
On the graph you can see the learning curves. On the X axis there are the numbers of examples that was included into training process. On the Y axis there is the value of root mean square error. The chart with learning curves shows us, that error on the training set is similar to the error on the testing set. It is not too big, so we can assume that model is not overfitted.
SVM is classification method whose learning aim is to determine the separating hyperplane with a maximum margin of examples belonging to two classes.
At the beginning, the seed is seting. It ensure the repeatability of experiments.
set.seed(17)
Then, it is possible to start learn process with SVM method. We choose the svmRadial method - the kernel used in training and predicting is based on radial basis.
svm_fit <- train(outcome ~ .,
data = training,
method = "svmRadial",
trControl = ctrl)
Like in the k-NN method, the prediction with samples classified and probability will be created.
svmClasses <- predict(svm_fit,
newdata = testing)
svmClasses_prob <- predict(svm_fit,
newdata = testing,
type = "prob")
Below you can find the confusion matrix for the model with SVM method.
svm_confMatrix <-caret::confusionMatrix(testing$outcome,
svmClasses)
svm_confMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction died survived
## died 49 3
## survived 4 56
##
## Accuracy : 0.9375
## 95% CI : (0.8755, 0.9745)
## No Information Rate : 0.5268
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8745
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9245
## Specificity : 0.9492
## Pos Pred Value : 0.9423
## Neg Pred Value : 0.9333
## Prevalence : 0.4732
## Detection Rate : 0.4375
## Detection Prevalence : 0.4643
## Balanced Accuracy : 0.9368
##
## 'Positive' Class : died
##
The model made fewer errors than the k-NN model Below the ROC curve for model with SVM.
svm_ROC <- roc(response = testing$outcome,
predictor = svmClasses_prob[, "died"],
levels = rev(levels(testing$outcome)))
plot(svm_ROC,
print.auc = TRUE,
col="black",
lwd = 5)
legend("bottomright",
col=c(par("fg"), "blue"),
legend = c("Empirical"),
lwd = 5)
AUC is big, like in the k-NN model. Below the function for creating learning curves for SVM method.
my_learning_curve_svm <- function(training_data,
test_data,
outcome = NULL,
trControl = NULL)
{
train_len <- nrow(training_data)
if(train_len < 2)
stop("Training data must have at least 2 data points.")
if(nrow(test_data) < 1)
stop("Test data must have at least 1 data points")
if(is.null(outcome))
stop("Please give a character string for the outcome column name.")
if(is.null(trControl))
stop("Please put the train controller.")
learnCurve <- data.frame(m = integer(train_len),
RMSE_type = character(train_len),
RMSE_value = integer(train_len))
j <- 0
tGrid <- expand.grid(k = 10)
for (i in c(seq(20,train_len, 20), train_len))
{
j <- j + 1
tr <- training_data[1:i,]
set.seed(17)
trainModel <- train(outcome ~ .,
data = tr,
method = "svmRadial",
trControl = ctrl)
learnCurve$m[j] <- i
learnCurve$RMSE_type[j] <- "Training Error"
learnCurve$RMSE_value[j] <- rmse(tr$outcome,
predict(trainModel,
newdata = tr))
j <- j + 1
learnCurve$m[j] <- i
learnCurve$RMSE_type[j] <- "Testing Error"
learnCurve$RMSE_value[j] <- rmse(test_data$outcome,
predict(trainModel,
newdata = test_data))
}
learnCurve[1:j,]
}
This function generates learning curves for a testing and a training set for model with SVM method. Below chart shows learning curves for training set and testing set.
svm_lmc <- my_learning_curve_svm(training_data = training,
test_data = testing,
outcome = "outcome",
trControl = ctrl)
svm_learning_curves <- ggplot(svm_lmc, aes(x = m)) +
geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) +
theme_bw() +
labs(title = "Learning Curves",
x = "Examples",
y = "RMSE value",
color = "Learning Curve")
ggsave(filename = "svm_learning_curves.svg",
plot = svm_learning_curves,
device = "svg",
path = "plots")
ggplotly(svm_learning_curves)
The error for the training set is lower than error for the testing set. You can see, that value of both errors are lower that errors in k-NN method. That means, the SVM model is better than k-NN.
Random Forests belong to the category of methods of constructing complex classifiers using the idea of modifying the set of attributes of the training data set. The Random Forest method uses only decision trees as base classifiers.
In the beginning, the seed is seting. It ensure the repeatability of experiments.
set.seed(17)
Then, it is possible to start learn process with Random Forests method. The number of trees in forest is set on 10.
rf_fit <- train(outcome ~ .,
data = training,
method = "rf",
trControl = ctrl,
ntree=10)
As previously, the prediction with samples classified and probability will be created.
rfClasses_prob <- predict(rf_fit,
newdata = testing,
type = "prob")
rfClasses <- predict(rf_fit,
newdata = testing)
importance_vector <- as.vector(rf_fit\(finalModel\)importance) Below the confusion matrix for the model with Random Forests method.
rf_confMatrix <- caret::confusionMatrix(data = rfClasses, testing$outcome)
rf_confMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction died survived
## died 51 2
## survived 1 58
##
## Accuracy : 0.9732
## 95% CI : (0.9237, 0.9944)
## No Information Rate : 0.5357
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9462
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9808
## Specificity : 0.9667
## Pos Pred Value : 0.9623
## Neg Pred Value : 0.9831
## Prevalence : 0.4643
## Detection Rate : 0.4554
## Detection Prevalence : 0.4732
## Balanced Accuracy : 0.9737
##
## 'Positive' Class : died
##
This model was the least mistaken comparing to the previous models. Below the ROC curve for model with Random Forests
rf_ROC <- roc(response = testing$outcome,
predictor = rfClasses_prob[, "died"],
levels = rev(levels(testing$outcome)))
plot(rf_ROC,
print.auc = TRUE,
col="black",
lwd = 5)
legend("bottomright",
col=c(par("fg"), "blue"),
legend = c("Empirical"),
lwd = 5)
AUC is, similar to the previous models, big. Below the function for creating learning curves.
my_learning_curve_rf <- function(training_data,
test_data,
outcome = NULL,
trControl = NULL)
{
train_len <- nrow(training_data)
if(train_len < 2)
stop("Training data must have at least 2 data points.")
if(nrow(test_data) < 1)
stop("Test data must have at least 1 data points")
if(is.null(outcome))
stop("Please give a character string for the outcome column name.")
if(is.null(trControl))
stop("Please put the train controller.")
learnCurve <- data.frame(m = integer(train_len),
RMSE_type = character(train_len),
RMSE_value = integer(train_len))
j <- 0
for (i in c(seq(5, train_len, 20), train_len))
{
j <- j + 1
tr <- training_data[1:i,]
set.seed(17)
trainModel <- train(outcome ~ .,
data = tr,
method = "rf",
trControl = trControl,
ktree = 10)
learnCurve$m[j] <- i
learnCurve$RMSE_type[j] <- "Training Error"
learnCurve$RMSE_value[j] <- rmse(tr$outcome,
predict(trainModel,
newdata = tr))
j <- j + 1
learnCurve$m[j] <- i
learnCurve$RMSE_type[j] <- "Testing Error"
learnCurve$RMSE_value[j] <- rmse(test_data$outcome,
predict(trainModel,
newdata = test_data))
}
learnCurve[1:j,]
}
This function generates learning curves for testing and training set for model with Random Forests method. Belows chart shows learning curves for training set and testing set.
rf_mlc <- my_learning_curve_rf(training_data = training,
test_data = testing,
outcome = "outcome",
trControl = ctrl)
rf_learning_curves <- ggplot(rf_mlc, aes(x = m)) +
geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) +
theme_bw() +
labs(title = "Learning Curves",
x = "Examples",
y = "RMSE value",
color = "Learning Curve")
ggsave(filename = "rf_learning_curves.svg",
plot = rf_learning_curves,
device = "svg",
path = "plots")
ggplotly(rf_learning_curves)
The error during the training process is less than the error during the testing process and the values of both errors is quite low.
We created 3 models of prediction based on 3 different methods of learning. Below the comparison of them.
First, we can compare the confusion matrix of each model. Below the confusion matrix for k-NN method.
knn_confMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction died survived
## died 48 7
## survived 4 53
##
## Accuracy : 0.9018
## 95% CI : (0.8311, 0.9499)
## No Information Rate : 0.5357
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8033
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.9231
## Specificity : 0.8833
## Pos Pred Value : 0.8727
## Neg Pred Value : 0.9298
## Prevalence : 0.4643
## Detection Rate : 0.4286
## Detection Prevalence : 0.4911
## Balanced Accuracy : 0.9032
##
## 'Positive' Class : died
##
Below the confusion matrix fot SVM method.
svm_confMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction died survived
## died 49 3
## survived 4 56
##
## Accuracy : 0.9375
## 95% CI : (0.8755, 0.9745)
## No Information Rate : 0.5268
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8745
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9245
## Specificity : 0.9492
## Pos Pred Value : 0.9423
## Neg Pred Value : 0.9333
## Prevalence : 0.4732
## Detection Rate : 0.4375
## Detection Prevalence : 0.4643
## Balanced Accuracy : 0.9368
##
## 'Positive' Class : died
##
Below the confusion matrix for Random Forests method.
rf_confMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction died survived
## died 51 2
## survived 1 58
##
## Accuracy : 0.9732
## 95% CI : (0.9237, 0.9944)
## No Information Rate : 0.5357
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9462
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9808
## Specificity : 0.9667
## Pos Pred Value : 0.9623
## Neg Pred Value : 0.9831
## Prevalence : 0.4643
## Detection Rate : 0.4554
## Detection Prevalence : 0.4732
## Balanced Accuracy : 0.9737
##
## 'Positive' Class : died
##
We see, the best confusion matrix is the best for method with Random Forests. That model made the fewest number of mistakes.
Let’s check the ROC curves. Below the charts with the ROC curves.
plot(knn_ROC,
col="red",
lwd = 5)
plot(svm_ROC,
col = "green",
add = TRUE,
lwd = 5)
plot(rf_ROC,
col = "blue",
add = TRUE,
lwd = 5)
legend("bottomright",
col=c("red", "green", "blue"),
legend = c("k-NN", "SVM", "Random Forests"),
lwd = 5)
On the chart above we see, that the ROC curve is the most convex for the Random Forests method.
Let’s compare the learning curves. First, we need to prepare datasets with results of learning.
knn_train_lmc <- knn_lmc %>%
filter(RMSE_type == "Training Error") %>%
mutate(method = "k-NN")
knn_test_lmc <- knn_lmc %>%
filter(RMSE_type == "Testing Error") %>%
mutate(method = "k-NN")
svm_train_lmc <- svm_lmc %>%
filter(RMSE_type == "Training Error") %>%
mutate(method = "SVM")
svm_test_lmc <- svm_lmc %>%
filter(RMSE_type == "Testing Error") %>%
mutate(method = "SVM")
rf_train_lmc <- rf_mlc %>%
filter(RMSE_type == "Training Error") %>%
mutate(method = "Random Forests")
rf_test_lmc <- rf_mlc %>%
filter(RMSE_type == "Testing Error") %>%
mutate(method = "Random Forests")
training_curves <- rbind(knn_train_lmc, svm_train_lmc, rf_train_lmc)
testing_curves <- rbind(knn_test_lmc, svm_test_lmc, rf_test_lmc)
After that, it is possible to create plots and compare then. First, we can compare the learning curves for the training set.
comp_training_curves <- ggplot(training_curves, aes(x = m)) +
geom_line(aes(y = RMSE_value, color = method), size=2.0) +
theme_bw() +
labs(title = "Learning Curves for training",
x = "Examples",
y = "RMSE value",
color = "Method of learning")
ggsave(filename = "comparison_training_curves.svg",
plot = comp_training_curves,
device = "svg",
path = "plots")
ggplotly(comp_training_curves)
We see, that the least training error is made by the model which learning method for Random Forests. Below the comparison of learning curves during the testing process.
comp_testing_curves <- ggplot(testing_curves, aes(x = m)) +
geom_line(aes(y = RMSE_value, color = method), size=2.0) +
theme_bw() +
labs(title = "Learning Curves for testing",
x = "Examples",
y = "RMSE value",
color = "Method of learning")
ggsave(filename = "comparison_testing_curves.svg",
plot = comp_testing_curves,
device = "svg",
path = "plots")
ggplotly(comp_testing_curves)
As you could see before, the best score of an error is in the model, which learning method was Random Forests. We can assume, that the best model of prediction is the model which learning method is Random Forests. In the following chapter, there will be attempt to improve the model by tuning.
To improve our model, we can reduce the number of attributes used in training the model. Let check what attributes are the most important in our best model. The importance of an attribute can be judged on the basis of the mean decrease gini index. Below, you will find a table with the attributes sorted by importance for the best model.
attributes <- colnames(select(ml_df, -"outcome"))
importance_vector <- as.vector(rf_fit$finalModel$importance)
importance_df <- data.frame(matrix(c(attributes, importance_vector), ncol = 2))
colnames(importance_df) <- c("Attribute", "Importance")
importance_df <- importance_df %>%
arrange(desc(as.numeric(Importance)))
kable(importance_df)
| Attribute | Importance |
|---|---|
| High sensitivity C-reactive protein | 44.1607344234113 |
| Lactate dehydrogenase | 18.0836587727037 |
| D-D dimer | 14.7009606239862 |
| neutrophils(%) | 13.3725297874191 |
| International standard ratio | 7.92895531755091 |
| (%)lymphocyte | 7.80154653278904 |
| Platelet count | 5.82501188186179 |
| age | 4.69779397159593 |
| procalcitonin | 3.59753869309935 |
| eosinophils(%) | 1.85726499671876 |
| thrombocytocrit | 1.61405463249789 |
| aspartate aminotransferase | 0.988070175438596 |
| lymphocyte count | 0.772881355932203 |
| monocytes(%) | 0.618302277432711 |
| ferritin | 0.494117647058824 |
| Serum chloride | 0.46 |
| fibrinogen | 0.4 |
| Activation of partial thromboplastin time | 0.392 |
| Eosinophil count | 0.372649572649573 |
| Tumor_necrosis_factor_alfa | 0.27 |
| indirect bilirubin | 0.198373983739837 |
| serum sodium | 0.198373983739837 |
| basophil count(#) | 0.1875 |
| Red blood cell distribution width | 0.18 |
| hematocrit | 0.18 |
| PLT distribution width | 0.171428571428571 |
| glucose | 0.166666666666667 |
| mean corpuscular hemoglobin | 0.166666666666667 |
| Red blood cell count | 0.1 |
| gender | 0.0962315462315462 |
| Total cholesterol | 0.0346320346320347 |
| glutamic-pyruvic transaminase | 0.0309597523219814 |
| PH value | 0.0272727272727272 |
| Hypersensitive cardiac troponinI | 0 |
| hemoglobin | 0 |
| Prothrombin time | 0 |
| Interleukin 2 receptor | 0 |
| Alkaline phosphatase | 0 |
| albumin | 0 |
| basophil(%) | 0 |
| Interleukin 10 | 0 |
| Total bilirubin | 0 |
| antithrombin | 0 |
| Interleukin 8 | 0 |
| total protein | 0 |
| Quantification of Treponema pallidum antibodies | 0 |
| Prothrombin activity | 0 |
| HBsAg | 0 |
| mean corpuscular volume | 0 |
| White blood cell count | 0 |
| mean corpuscular hemoglobin concentration | 0 |
| Interleukin_1_Beta | 0 |
| Urea | 0 |
| Corrected calcium | 0 |
| Serum potassium | 0 |
| neutrophils count | 0 |
| Direct bilirubin | 0 |
| Mean platelet volume | 0 |
| RBC distribution width SD | 0 |
| Thrombin time | 0 |
| HCV antibody quantification | 0 |
| Uric acid | 0 |
| HCO3- | 0 |
| calcium | 0 |
| Amino-terminal brain natriuretic peptide precursor(NT-proBNP) | 0 |
| platelet large cell ratio | 0 |
| Interleukin 6 | 0 |
| Fibrin degradation products | 0 |
| monocytes count | 0 |
| globulin | 0 |
| gamma_glutamyl_transpeptidase | 0 |
| 2019-nCoV nucleic acid detection | 0 |
| HIV antibody quantification | 0 |
| ESR | 0 |
| eGFR | 0 |
| creatinine | 0 |
An attempt was made to limit the number of attributes to which importance is higher than 0. The most important attributes have been selected from the list above.
most_important_attr_rf <- importance_df %>%
filter(Importance > 0)
most_important_attr_rf <- most_important_attr_rf[1:ceiling(nrow(most_important_attr_rf)),1]
new_training <- training %>%
select(c("outcome", all_of(most_important_attr_rf)))
new_testing <- testing %>%
select(c("outcome", all_of(most_important_attr_rf)))
In the following, you will find the same process of training model like above.
First, the seed must be set.
set.seed(17)
Now, the training process with new training data.
new_rf_fit <- train(outcome ~ .,
data = new_training,
method = "rf",
trControl = ctrl,
ntree=10)
With the tuned model, we can predict classes. As previously, both types of prediction will be generated.
new_rfClasses_prob <- predict(new_rf_fit,
newdata = new_testing,
type = "prob")
new_rfClasses <- predict(new_rf_fit,
newdata = new_testing)
Below the confusion matrix for tuned model.
new_rf_confMatrix <- caret::confusionMatrix(data = new_rfClasses, new_testing$outcome)
new_rf_confMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction died survived
## died 52 4
## survived 0 56
##
## Accuracy : 0.9643
## 95% CI : (0.9111, 0.9902)
## No Information Rate : 0.5357
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9286
##
## Mcnemar's Test P-Value : 0.1336
##
## Sensitivity : 1.0000
## Specificity : 0.9333
## Pos Pred Value : 0.9286
## Neg Pred Value : 1.0000
## Prevalence : 0.4643
## Detection Rate : 0.4643
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9667
##
## 'Positive' Class : died
##
For now, you can see taht, the tuned model made more mistakes that original model, but did not make mistakes the so-called False Positive, which should be important to us in terms of prediction - the person who survives is qualified as the on who will die. This type of error is more desirable if we can choose between it and a False Negative error, which will qualify the person who dies as a survivor.
Below the ROC curve for tuned model.
new_rf_ROC <- roc(response = new_testing$outcome,
predictor = new_rfClasses_prob[, "died"],
levels = rev(levels(new_testing$outcome)))
plot(new_rf_ROC,
print.auc = TRUE,
col="black",
lwd = 5)
legend("bottomright",
col=c(par("fg"), "blue"),
legend = c("Empirical"),
lwd = 5)
The area under curve (AUC) is bigger than in the original model. That means, the tuned model is more efficient than the original.
Let check the learning curve for tuned model. The original function created for Random Forests will be used.
new_rf_mlc <- my_learning_curve_rf(training_data = new_training,
test_data = new_testing,
outcome = "outcome",
trControl = ctrl)
tuned_model_learning_curves_1 <- ggplot(new_rf_mlc, aes(x = m)) +
geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) +
theme_bw() +
labs(title = "Learning Curves",
x = "Examples",
y = "RMSE value",
color = "Learning Curve")
ggsave(filename = "tuned_model_learning_curves_1.svg",
plot = tuned_model_learning_curves_1,
device = "svg",
path = "plots")
ggplotly(tuned_model_learning_curves_1)
The chart above, with the learning curves does not differ much from the original. The testing error is on the same level, only the training error has increased, but acceptable. Let’s check if it is possible to tune model more that now. The whole process must be repeated.
First, the extraction of most important attributes.
tuned_attr <- colnames(select(new_training, -"outcome"))
importance_vector2 <- as.vector(new_rf_fit$finalModel$importance)
importance_df2 <- data.frame(matrix(c(tuned_attr, importance_vector2), ncol = 2))
colnames(importance_df2) <- c("Attribute", "Importance")
importance_df2 <- importance_df2 %>%
arrange(desc(as.numeric(Importance)))
kable(importance_df2)
| Attribute | Importance |
|---|---|
| Lactate dehydrogenase | 34.7140969637687 |
| lymphocyte count | 11.9829323741753 |
| D-D dimer | 11.5085680462158 |
| High sensitivity C-reactive protein | 7.03095580994492 |
| monocytes(%) | 6.56151731684465 |
| thrombocytocrit | 6.36876513255838 |
| Serum chloride | 5.7184954140937 |
| PLT distribution width | 5.43850297906969 |
| eosinophils(%) | 4.99905346238046 |
| aspartate aminotransferase | 4.72104091184587 |
| neutrophils(%) | 3.02638249371786 |
| glucose | 2.78834686726925 |
| age | 2.7695633118024 |
| Red blood cell distribution width | 2.31061815531321 |
| (%)lymphocyte | 1.66170233663501 |
| gender | 1.56580213903743 |
| ferritin | 1.56060039381754 |
| Total cholesterol | 1.55755526790951 |
| procalcitonin | 1.41184022610888 |
| Tumor_necrosis_factor_alfa | 1.39327698630936 |
| Activation of partial thromboplastin time | 1.37429218400686 |
| International standard ratio | 1.23499571887807 |
| PH value | 1.10586534002125 |
| serum sodium | 0.839661089661089 |
| hematocrit | 0.814999140300108 |
| Platelet count | 0.7990459931305 |
| Eosinophil count | 0.77741190520095 |
| fibrinogen | 0.568508641600968 |
| indirect bilirubin | 0.375051247771836 |
| mean corpuscular hemoglobin | 0.301658085697764 |
| Red blood cell count | 0.231741911560211 |
| basophil count(#) | 0.211228991596638 |
| glutamic-pyruvic transaminase | 0.198854256854258 |
Let’s try to train model attributes, which their importnace is higher than 5. There is only 8 these attributes.
most_important_attr_rf2 <- importance_df2 %>%
filter(as.numeric(Importance) > 5) %>%
select(Attribute)
kable(most_important_attr_rf2)
| Attribute |
|---|
| Lactate dehydrogenase |
| lymphocyte count |
| D-D dimer |
| High sensitivity C-reactive protein |
| monocytes(%) |
| thrombocytocrit |
| Serum chloride |
| PLT distribution width |
Now you should build a test ad training set based on the received attributes.
most_important_attr_rf2 <- most_important_attr_rf2$Attribute
new_training2 <- training %>%
select(c("outcome", all_of(most_important_attr_rf2)))
new_testing2 <- testing %>%
select(c("outcome", all_of(most_important_attr_rf2)))
Then, the model can be trained.
set.seed(17)
new_rf_fit2 <- train(outcome ~ .,
data = new_training2,
method = "rf",
trControl = ctrl,
ntree = 10)
With trained model, it is possible to predict the results.
new_rfClasses2 <- predict(new_rf_fit2,
newdata = new_testing2)
new_rfClasses2_prob <- predict(new_rf_fit2,
newdata = new_testing2,
type = "prob")
Below the confusion matrix to tuned model.
new_rf_confMatrix2 <- caret::confusionMatrix(data = new_rfClasses2,
new_testing2$outcome)
new_rf_confMatrix2
## Confusion Matrix and Statistics
##
## Reference
## Prediction died survived
## died 52 4
## survived 0 56
##
## Accuracy : 0.9643
## 95% CI : (0.9111, 0.9902)
## No Information Rate : 0.5357
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9286
##
## Mcnemar's Test P-Value : 0.1336
##
## Sensitivity : 1.0000
## Specificity : 0.9333
## Pos Pred Value : 0.9286
## Neg Pred Value : 1.0000
## Prevalence : 0.4643
## Detection Rate : 0.4643
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9667
##
## 'Positive' Class : died
##
We see, that there is no differences between this model and original tuned model. Let’s check the ROC curve and AUC value.
new_rf_ROC2 <- roc(response = new_testing2$outcome,
predictor = new_rfClasses2_prob[, "died"],
levels = rev(levels(new_testing2$outcome)))
plot(new_rf_ROC2,
print.auc = TRUE,
col="black",
lwd = 5)
legend("bottomright",
col=c(par("fg"), "blue"),
legend = c("Empirical"),
lwd = 5)
The graph above tells us, that this model is not as good as the previous one. But, we have limited number of attributes from 33 to 8 attributes. It means, that we do not need to know the values of 33 the attributes, we need only 8 attributes to predict the result with the same accuracy. Below the chart with learning curves of new tuned model.
new_rf_mlc2 <- my_learning_curve_rf(training_data = new_training2,
test_data = new_training2,
outcome = "outcome",
trControl = ctrl)
tuned_model_learning_curves_2 <- ggplot(new_rf_mlc2, aes(x = m)) +
geom_line(aes(y = RMSE_value, color = RMSE_type), size=2.0) +
theme_bw() +
labs(title = "Learning Curves",
x = "Examples",
y = "RMSE value",
color = "Learning Curve")
ggsave(filename = "tuned_model_learning_curves_2.svg",
plot = tuned_model_learning_curves_2,
device = "svg",
path = "plots")
ggplotly(tuned_model_learning_curves_2)
The chart above shows us, that the training and testing error is on the same level. But both value decreased. So, it means, that next iteration of tuning gave us better model.
In the document we can find the analysis of the COVID-19 cases. After the short introduction, there is a part with graphical presentation of data that were used in report. After the graphic part, there was an attempt to create machine learning model of prediction the mortality of the patients. This model can be used e.g. in hospitals to judge, whose patient need to be hospitalized immediately. We extracted 8 attributes that are need to predict, with high efficiency, the mortality of patient. Below is the table with these attributes:
results <- data.frame(most_important_attr_rf2)
colnames(results) <- "Most important attributes"
kable(results)
| Most important attributes |
|---|
| Lactate dehydrogenase |
| lymphocyte count |
| D-D dimer |
| High sensitivity C-reactive protein |
| monocytes(%) |
| thrombocytocrit |
| Serum chloride |
| PLT distribution width |
These biological factor are significant to predict the cases of COVID by doctors. Some medical articles reach similar conclusions about test results, e.g. Hematologic, biochemical and immune biomarker abnormalities associated with severe illness and mortality in coronavirus disease 2019 (COVID-19): a meta-analysis, Clinical characteristics of coronavirus disease 2019 (COVID-19) in China: A systematic review and meta-analysis or Thrombocytopenia is associated with severe coronavirus disease 2019 (COVID-19) infections: A meta-analysis.
tree_func <- function(final_model,
tree_num) {
#source: https://shiring.github.io/machine_learning/2017/03/16/rf_plot_ggraph
# get tree by index
tree <- randomForest::getTree(final_model,
k = tree_num,
labelVar = TRUE) %>%
tibble::rownames_to_column() %>%
# make leaf split points to NA, so the 0s won't get plotted
mutate(`split point` = ifelse(is.na(prediction), `split point`, NA))
# prepare data frame for graph
graph_frame <- data.frame(from = rep(tree$rowname, 2),
to = c(tree$`left daughter`, tree$`right daughter`))
# convert to graph and delete the last node that we don't want to plot
graph <- graph_from_data_frame(graph_frame) %>%
delete_vertices("0")
# set node labels
V(graph)$node_label <- gsub("_", " ", as.character(tree$`split var`))
V(graph)$leaf_label <- as.character(tree$prediction)
V(graph)$split <- as.character(round(tree$`split point`, digits = 2))
# plot
plot <- ggraph(graph, 'dendrogram') +
theme_bw() +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = node_label), na.rm = TRUE, repel = TRUE) +
geom_node_label(aes(label = split), vjust = 2.5, na.rm = TRUE, fill = "white") +
geom_node_label(aes(label = leaf_label, fill = leaf_label), na.rm = TRUE,
repel = TRUE, colour = "white", fontface = "bold", show.legend = FALSE) +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.background = element_blank(),
plot.background = element_rect(fill = "white"),
panel.border = element_blank(),
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.title = element_text(size = 18))
print(plot)
}
Below example of the tree from forests.
tree <- tree_func(final_model = new_rf_fit2$finalModel,
tree_num = 5)
The conclusions are following. The model of prediction could be use to assess the state of health the patients. The model sufficiently reflects reality, but it only considers blood tests’ results. The Patient’s symptoms and the other data such as age, gender, comorbidities are not taken into account. First of all, the model should play a supporting role in diagnosing the course of the disease - it should not, prejudge the treatment of the patient. It should be remembered that each case is different and an individual course of treatment should be selected for each case.
Aleksandra Chachowska, lek. Paweł Żmuda-Trzebiatowski lek. 2020. “Monocyty. Kiedy Warto Je Zbadać?” https://www.medonet.pl/zdrowie/zdrowie-dla-kazdego,monocyty---norma--podwyzszone--za-wysokie--powyzej-normy-,artykul,1655749.html.
Aleksandra Witkowska. 2020a. “Albuminy.” https://www.medonet.pl/badania,albuminy,artykul,1570119.html.
———. 2020b. “Gamma-Globuliny.” https://www.medonet.pl/badania,gamma-globuliny,artykul,1570125.html.
Andrzej Dębski, lek. Paweł Żmuda-Trzebiatowski. 2020. “Trombocyty. Płytki Krwi, Zaburzenia Krzepnięcia I Inne Schorzenia.” https://www.medonet.pl/zdrowie/zdrowie-dla-kazdego,plytki-krwi--trombocyty--plt---norma--niski-i-wysoki-poziom,artykul,1719033.html.
“AST, Aspat, Czyli Aminotransferaza Asparaginianowa - Normy.” 2020. https://www.poradnikzdrowie.pl/sprawdz-sie/badania/aspat-czyli-aminotransferaza-asparaginianowa-norm-aa-jJJ4-jnLP-6cmP.html.
“Badanie Alt – Norma I Wyniki. Co Znaczy Wysokie Alt?” 2020. https://www.medme.pl/artykuly/badanie-alt-badanie-krwi-w-diagnozie-uszkodzen-watroby-norma,67611.html.
Bartosz Domagała. 2020. “Badanie Gfr.” https://www.medonet.pl/zdrowie,badanie-gfr---co-to--przygotowanie--wynik--normy,artykul,1728109.html.
“Bazofile (Granulocyty Zasadochłonne) - Normy. Co Oznacza Wysoki Lub Niski Poziom Bazofili?” 2020. https://lekarzebezkolejki.pl/blog/bazofile-granulocyty-zasadochlonne-normy-co-oznacza-wysoki-lub-niski-poziom-bazofili/w-612.
Beata Wańczyk-Dręczewska. 2020a. “Bilirubina.” https://www.medonet.pl/badania,bilirubina-calkowita---norma-badania--podwyzszona-bilirubina,artykul,1570127.html.
———. 2020b. “Kwas Moczowy.” https://www.medonet.pl/badania,kwas-moczowy-we-krwi---norma-badania-w-surowicy,artykul,1570135.html.
“Bilirubina Bezpośrednia (Sprzężona) - Objawy, Leczenie I Badania.” 2020. https://www.doz.pl/zdrowie/h11103-Bilirubina_bezposrednia_sprzezona.
“Bilirubina Pośrednia Badanie.” 2020. https://www.synevo.pl/bilirubina-posrednia/.
Brandon Michael Henry, Stefanie Benoit, Maria Helena Santos de Oliveira. 2020. “Hematologic, Biochemical and Immune Biomarker Abnormalities Associated with Severe Illness and Mortality in Coronavirus Disease 2019 (Covid-19): A Meta-Analysis.” https://pubmed.ncbi.nlm.nih.gov/32286245/.
“Cholesterol Całkowity.” 2020. https://www.medicover.pl/badania/cholesterol-calkowity/#cholesterol_calkowity_norma.
“D-Dimery: Norma, Badanie I Wyniki. Przyczyny Podwyższonych?” 2020. https://www.medme.pl/artykuly/d-dimery-norma-badanie-i-wyniki-przyczyny-podwyzszonych,73373.html.
“Eozynofil.” 2020. https://pl.wikipedia.org/wiki/Eozynofil.
“Erytrocyty (Czerwona Krwinki) - Jaka Jest Ich Prawidłowa Ilość?” 2020. https://lekarzebezkolejki.pl/blog/erytrocyty-czerwona-krwinki-jaka-jest-ich-prawidlowa-ilosc/w-473.
Ewa Talarek. 2020a. “Badanie Stężenia Potasu W Surowicy Krwi.” https://www.mp.pl/pacjent/badania_zabiegi/174203,badanie-stezenia-potasu-w-surowicy-krwi.
———. 2020b. “Badanie Stężenia Sodu W Surowicy Krwi.” https://www.mp.pl/pacjent/badania_zabiegi/174208,badanie-stezenia-sodu-w-surowicy-krwi.
———. 2020c. “Badanie Stężenia Wapnia W Surowicy Krwi.” https://www.mp.pl/pacjent/badania_zabiegi/174223,badanie-stezenia-wapnia-w-surowicy-krwi.
“Ferrytyna - Normy I Znaczenie Dla Zdrowia.” 2020. https://lekarzebezkolejki.pl/blog/ferrytyna-normy-i-znaczenie-dla-zdrowia/w-963.
“Fosfataza Alkaliczna (Alp).” 2020. https://www.medicover.pl/badania/fosfataza-alkaliczna/.
Giuseppe Lippi, Brandon Michael Henry, Mario Plebani. 2020. “Thrombocytopenia Is Associated with Severe Coronavirus Disease 2019 (Covid-19) Infections: A Meta-Analysis.” https://pubmed.ncbi.nlm.nih.gov/32178975/.
“Glukoza.” 2020. https://www.medonet.pl/badania,glukoza-we-krwi---badanie-glukozy-w-surowicy--norma,artykul,1570132.html.
“HBsAg – Wykrywa Zakażenie Wzw Typu B.” 2020. https://www.medonet.pl/badania,hbsag---wykrywa-zakazenie-wzw-typu-b,artykul,1733281.html.
“HCV Rna Met. Real Time Rt- Pcr, Ilościowo.” 2020. https://diag.pl/sklep/badania/hcv-met-pcr-ilosciowo/.
“Hematokryt (Hct) – Normy. Co Oznacza Niski I Wysoki Poziom Hct?” 2020. https://lekarzebezkolejki.pl/blog/hematokryt-hct-normy-co-oznacza-niski-i-wysoki-poziom-hct/w-603.
“Hemoglobina - Jaki Jest Jej Prawidłowy Poziom?” 2020. https://lekarzebezkolejki.pl/blog/hemoglobina-jaki-jest-jej-prawidlowy-poziom/w-433.
“HIV-1 Rna Met. Real Time Rt-Pcr, Ilościowo.” 2020. https://diag.pl/sklep/badania/hiv-1-met-pcr-ilosciowo/.
“Interleukiny - Rodzaje, Funkcje, Rola W Powstawaniu Chorób Autoimmunologicznych.” 2020. https://centermed.pl/news/631,Interleukiny__rodzaje_funkcje_rola_w_powstawaniu_chorob_autoimmunologicznych.
Irina Bosek, Tomasz Miłek, Roman Kuczerowski. 2020. “Stężenie Interleukiny 2 I Interleukiny 10 U Pacjentów Z Cukrzycą Typu 2 I Rakiem Okrężnicy.” https://journals.viamedica.pl/diabetologia_praktyczna/article/download/58009/43758.
Janowska, Sara. 2020. “Limfocytoza - O Czym świadczy Zbyt Wysoki Poziom Limfocytów?” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/limfocytoza-o-czym-swiadczy-zbyt-wysoki-poziom-limfocytow-aa-K94G-LDUH-GUH2.html.
Jarosz, Anna. 2020. “Badanie Inr - Wskazania, Wyniki, Normy.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/badanie-inr-wskazania-wyniki-normy-aa-gZ2i-NFtB-SSpt.html.
Jurewicz, Andrzej. 2020a. “Badanie Crp - Normy, Interpretacja Wyników. Podwyższone Crp a Diagnostyka Chorób.” https://www.medonet.pl/zdrowie/zdrowie-dla-kazdego,crp-badanie---norma-badania-krwi--podwyzszone--wysokie-crp-,artykul,1659270.html#hs-crp-badanie-crp-o-wysokiej-czulosci.
———. 2020b. “Mocznik, Azot Mocznikowy - Badanie, Normy. Obniżone I Podwyższone Stężenie Mocznika.” https://www.medonet.pl/badania,mocznik-we-krwi--bun----norma-badania,artykul,1570136.html.
Karolina Karabin. 2020a. “Białko Całkowite - Normy W Badaniu Biochemicznym.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/bialko-calkowite-normy-w-badaniu-biochemicznym-aa-tj5J-sRC7-u7oy.html.
———. 2020b. “Interleukina-6 - Badanie, Znaczenie I Normy.” https://www.poradnikzdrowie.pl/zdrowie/uklad-odpornosciowy/interleukina-6-badanie-znaczenie-i-normy-aa-MVWT-gjDc-7RUx.html.
Katarzyna Pawlikowska-Łagód, lek. Beata Wańczyk-Dręczewska. 2020a. “MCHC (średnie Stężenie Hemoglobiny W Krwince Czerwonej) - Normy, Nadmiar, Leczenie.” https://www.medonet.pl/badania,mchc---srednie-stezenie-hemoglobiny-w-krwince-czerwonej,artykul,1570107.html#mchc-ponizej-normy.
———. 2020b. “MCHC (średnie Stężenie Hemoglobiny W Krwince Czerwonej) - Normy, Nadmiar, Leczenie.” https://www.medonet.pl/badania,mchc---srednie-stezenie-hemoglobiny-w-krwince-czerwonej,artykul,1570107.html.
“Kiła (Treponema Pallidum), Vdrl, Monitorowanie Leczenia.” 2020. https://diag.pl/sklep/badania/kila-treponema-pallidum-vdrl/.
Klaudia Knap. 2020a. “Czas Protrombinowy I Inr.” https://www.mp.pl/pacjent/badania_zabiegi/152270,czas-protrombinowy-i-inr.
———. 2020b. “Troponiny Sercowe.” https://www.mp.pl/pacjent/badania_zabiegi/152291,troponiny-sercowe.
Kornel Gajewski. 2020a. “Chlorki.” https://www.mp.pl/pacjent/badania_zabiegi/171833,chlorki.
———. 2020b. “Dehydrogenaza Mleczanowa (Ldh).” https://www.mp.pl/pacjent/badania_zabiegi/171840,dehydrogenaza-mleczanowa-ldh.
———. 2020c. “Gamma-Glutamylotranspeptydaza (Ggtp).” https://www.mp.pl/pacjent/badania_zabiegi/137074,gamma-glutamylotranspeptydaza-ggtp.
“Kreatynina - Czym Jest? Badanie Kreatyniny I Normy.” 2020. https://lekarzebezkolejki.pl/blog/kreatynina-czym-jest-badanie-kreatyniny-i-normy/w-1328.
Leiwen Fu, Tanwei Yuan, Bingyi Wang. 2020. “Clinical Characteristics of Coronavirus Disease 2019 (Covid-19) in China: A Systematic Review and Meta-Analysis.” https://pubmed.ncbi.nlm.nih.gov/32283155/.
Li Yan, [...], Hai-Tao Zhang. 2020. “An Interpretable Mortality Prediction Model for Covid-19 Patients.” Nature Machine Intelligence. https://www.nature.com/articles/s42256-020-0180-7.
Majewska, Monika. 2020a. “Neutrofile: Normy. Podwyższony, Obniżony Poziom Neut.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/neutrofile-normy-podwyzszony-obnizony-poziom-neut-aa-EzVK-Qfpg-QhBs.html.
———. 2020b. “Prokalcytonina (Pct) - Normy I Wyniki Badania.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/prokalcytonina-pct-normy-i-wyniki-badania-aa-5gM9-PEpJ-vpji.html.
Mariusz Basiura. 2020. “Eozynofile (Eozynocyty, Eos): Podwyższone Lub Poniżej Normy.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/eozynofile-eozynocyty-eos-podwyzszone-lub-ponizej-normy-aa-sAcV-QXWn-EJti.html.
Marlena Kostyńska, lek. Aleksandra Witkowska. 2020. “Leukocyty. Kiedy Wykonać Badanie? O Czym świadczą Wyniki?” https://www.medonet.pl/badania,leukocyty---rodzaje--badanie-krwi--normy--obecnosc-w-moczu,artykul,1570110.html.
Marlena Kostyńska, lek. Beata Wańczyk-Dręczewska. 2020. “Czas Kaolinowo-Kefalinowy (Aptt). Pomiar Czasu Częściowej Tromboplastyny Po Aktywacji.” https://www.medonet.pl/badania,czas-kaolinowo-kefalinowy--czas-czesciowej-tromboplastyny-po-aktywacji--ang--activated-partial-thromboplastin-time--aptt-,artykul,1570157.html.
Marlena Kostyńska, lek. Paweł Żmuda-Trzebiatowski. 2020a. “Czas Trombinowy (Tt) – Krzepnięcie Krwi, Przebieg Badania, Odstępstwa Od Normy.” https://www.medonet.pl/badania,czas-trombinowy--tt----krzepniecie-krwi--przebieg-badania--odstepstwa-od-normy,artykul,1734547.html.
———. 2020b. “Fibrynogen – Kiedy Oznaczamy, Przebieg Badania, Odstępstwa Od Normy.” https://www.medonet.pl/badania,fibrynogen---kiedy-oznaczamy--przebieg-badania--odstepstwa-od-normy,artykul,1570274.html.
Marta Pawlak, lek. Aleksandra Witkowska. 2020a. “Antytrombina - Kiedy Sprawdzić Jej Aktywność I Stężenie?” https://www.medonet.pl/badania,antytrombina---kiedy-sprawdzic-jej-aktywnosc-i-stezenie-,artykul,1734857.html.
———. 2020b. “RDW-Sd - Oznaczanie, Wskazania, Norma, Analiza Wyników.” https://www.medonet.pl/zdrowie,rdw-sd---oznaczanie--wskazania--norma--analiza-wynikow,artykul,1733018.html.
Mazur, Justyna. 2020. “TNF Alfa (Czynnik Martwicy Nowotworu) – Badanie, Norma, Leczenie.” https://wylecz.to/badania-laboratoryjne/tnf-alfa-czynnik-martwicy-nowotworu-badanie-norma-leczenie/.
“MCV (średnia Objętość Erytrocytów) - Normy, Wysoki I Niski Poziom.” 2020. https://lekarzebezkolejki.pl/blog/mcv-srednia-objetosc-erytrocytow-normy-wysoki-i-niski-poziom/w-1327.
Monika Dąbrowska. 2020. “Limfocyty - Czym Są, Jakie Są Normy, Co Oznacza Spadek Poziomu Limfocytów.” https://enel.pl/enelzdrowie/zdrowie/limfocyty-czym-sa-jakie-sa-normy-co-oznacza-spadek-poziomu-limfocytow.
Monika Wasilonek, lek. Aleksandra Witkowska. 2020. “Badanie P-Lcr - Wskazania Do Wykonania I Normy.” https://www.medonet.pl/zdrowie,badanie-p-lcr---wskazania-do-wykonania-i-normy,artykul,1733026.html#badanie-p-lcr-normy.
Morzy, Tadeusz. 2013. Eksploracja Danych. Metody I Algorytmy.
“Neutrofile Poniżej Normy. Jakie Są Przyczyny?” 2020. https://lekarzebezkolejki.pl/blog/neutrofile-ponizej-normy-jakie-sa-przyczyny/w-614.
Nowacka, Mirosława. 2020. “Wapń Całkowity, Zjonizowany I Skorygowany.” http://odiagnostyce.pl/2016/08/20/wapn-calkowity-zjonizowany-i-skorygowany.
“NT Pro-Bnp – Charakterystyka, Wskazania Do Badania, Przeprowadzenie Badania, Norma, Interpretacja Wyników.” 2020. https://portal.abczdrowie.pl/nt-pro-bnp-charakterystyka-wskazania-do-badania-przeprowadzenie-badania-norma-interpretacja-wynikow.
Paculanka, Agnieszka. 2020. “OB Czyli Odczyn Biernackiego. Jakie Są Normy Ob?” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/ob-odczyn-biernackiego-jakie-sa-normy-ob-aa-ew6R-HroV-s9b2.html.
“PDW (Morfologia Krwi) - Norma, Podwyższone I Obniżone Pdw.” 2020. https://lekarzebezkolejki.pl/blog/pdw-morfologia-krwi-norma-podwyzszone-i-obnizone-pdw/w-1452.
Piotr Podwysocki. 2020. “MPV - Norma, Podwyższone, Poniżej Normy.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/mpv-morfologia-norma-podwyzszone-ponizej-normy-aa-Mdz2-jMCL-HTi5.html.
“Produkty Degradacji Fibrynogenu/Fibryny (Fdp, Ang. Fibrin/Fibrinogen Degradation Products).” 2020. https://www.medonet.pl/badania,produkty-degradacji-fibrynogenu-fibryny--fdp--ang--fibrin-fibrinogen-degradation-products-,artykul,1570240.html#interpretacja-wynikow-badan.
Socha-Chyrzyński, Natasza. 2020. “Gazometria: Wyniki I Normy pH Krwi.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/gazometria-wyniki-i-normy-ph-krwi-aa-6EWC-V3wE-QRN5.html.
“Thrombocytocrit.” 2020. https://medical-dictionary.thefreedictionary.com/thrombocytocrit.
Tomasz Nęcki. 2020a. “Monocyty (Mono) - Rola, Norma, Nadmiar I Niedobór.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/monocyty-mono-rola-norma-nadmiar-i-niedobor-aa-FvLD-BrBi-DChY.html.
———. 2020b. “Wskaźnik Rdw Sd (Wskaźnik Rozkładu Objętości Erytrocytów) - Normy.” https://www.poradnikzdrowie.pl/sprawdz-sie/badania/wskaznik-rdw-sd-wskaznik-rozkladu-objetosci-erytrocytow-normy-aa-9gN7-txfr-Ahbo.html.
Zuzanna Łagódka. 2020. “Gazometria - Czym Jest to Badanie?” https://enel.pl/enelzdrowie/zdrowie/gazometria-czym-jest-to-badanie.